-
Notifications
You must be signed in to change notification settings - Fork 1.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
🐛 Do not update MS status when unable to get workload cluster or machine node #10436
Conversation
[APPROVALNOTIFIER] This PR is NOT APPROVED This pull-request has been approved by: The full list of commands accepted by this bot can be found here.
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
Hi @jessehu. Thanks for your PR. I'm waiting for a kubernetes-sigs member to verify that this patch is reasonable to test. If it is, they should reply with Once the patch is verified, the new status will be reflected by the I understand the commands that are listed here. Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
/area machineset |
/ok-to-test |
@@ -883,8 +888,7 @@ func (r *Reconciler) updateStatus(ctx context.Context, cluster *clusterv1.Cluste | |||
|
|||
node, err := r.getMachineNode(ctx, cluster, machine) | |||
if err != nil && machine.GetDeletionTimestamp().IsZero() { | |||
log.Error(err, "Unable to retrieve Node status", "node", klog.KObj(node)) | |||
continue | |||
return errors.Wrapf(err, "unable to retrieve the status of Node %s", klog.KObj(node)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this doesn't address:
For MachineSet specifically: It definitely solves the "happy path", but it makes the "unhappy path" worse by freezing the status indefinitely. Without the changes the ready and available replicas were dropping to 0 when the communication broke down. Probably a matter of preference if in case of the communication breakdown we prefer to keep ready and available replicas as is or drop them to 0. But another side effect is also that we don't update ObservedGeneration anymore, which definitely seems wrong (similar for the conditions)
Specifically:
"unhappy path" worse by freezing the status indefinitely
But another side effect is also that we don't update ObservedGeneration anymore, which definitely seems wrong (similar for the conditions)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @sbueringer.
"unhappy path" worse by freezing the status indefinitely
This PR does not update any fields of MS.Status when unable to get workload cluster or machine Node due to ErrClusterLocked or any other errors. Because the ErrClusterLocked error can be recovered soon after reconciling again, and the error that cannot get machine Node (e.g. network issue, or apiserver unavailable temporarily) should also be recovered soon after reconciling again. Are there other "unhappy path" you meant?
But another side effect is also that we don't update ObservedGeneration anymore, which definitely seems wrong (similar for the conditions).
Since this PR returns error at line 891, so it won't update ms.Status at line 904 and newStatus.ObservedGeneration at line 924. This is as expected.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
and the error that cannot get machine Node (e.g. network issue, or apiserver unavailable temporarily) should also be recovered soon after reconciling again. Are there other "unhappy path" you meant?
Yup, apiserver down for a longer period of time / indefinitely
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In case the apiserver goes down for a longer period of time or indefinitely, KCP controller should catch it and handle correctly.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
KCP won't always be able to recover (if that's what you meant)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi @fabriziopandini as discussed with @sbueringer in current thread, IMHO this PR provides a simple patch to solve the MS status update issue described in #10195, and there would be no need to introduce the suggested handling code (provided by @sbueringer and IMHO a little complex)
- I had replied my thought in 🐛 Do not update KCP and MS status when unable to get workload cluster #10229 (comment) and also removed the KCP status handling code as you suggested.
- In case the apiserver goes down for a longer period of time or indefinitely, the MS status won't be updated, but KCP controller should catch it and handle correctly. In this case updating MS status won't help.
BTW I was on vacation last week so sorry for the late reply. Thanks!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I will look into it as soon as possible, but we already invested a lot of time in trying to find an acceptable way forward.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I tried to express above that it's not acceptable that the MS status is permanently not updated anymore when a workload cluster stays unreachable
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As I described that
2) in case the apiserver goes down for a longer period of time or indefinitely, the MS status won't be updated, but KCP controller should catch it and handle correctly. In this case updating MS status won't help.
3) Even if we want to update MS status in this special case, it would be not good to set th MS ready replicas to 0, instead IMHO a new status UnknownReplicas should be added (but this changes the current API definition and is not a trivial change).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sure updating the MS status won't contribute to KCP trying to remediate the problem, but it will actually keep the status up-to-date. The status is much more than just the replica fields
The Kubernetes project currently lacks enough contributors to adequately respond to all PRs. This bot triages PRs according to the following rules:
You can:
Please send feedback to sig-contributor-experience at kubernetes/community. /lifecycle stale |
The Kubernetes project currently lacks enough active contributors to adequately respond to all PRs. This bot triages PRs according to the following rules:
You can:
Please send feedback to sig-contributor-experience at kubernetes/community. /lifecycle rotten |
The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs. This bot triages PRs according to the following rules:
You can:
Please send feedback to sig-contributor-experience at kubernetes/community. /close |
@k8s-triage-robot: Closed this PR. In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. |
What this PR does / why we need it:
The ErrClusterLocked error causes the MD.Status.ReadyReplicas changes from 3 to 0 and after about 90s it will be changed back to 3. The reason is updateStatus() in machineset_controller.go ignores the error returned by getMachineNode() and treats the Node as not ready. Setting MD.Status.ReadyReplicas from 3 to 0 is unreasonable and causes issues in our project on top of CAPI.
Which issue(s) this PR fixes:
Fixes #10195
Changes
Do not update any fields of MS.Status when unable to get workload cluster or machine Node due to ErrClusterLocked or any other errors. Because the ErrClusterLocked error can be recovered soon after reconciling again, and the error that cannot get machine Node (e.g. network issue, or apiserver unavailable temporarily) should also be recovered soon after reconciling again.
Please refer to more discussion on #10229 (comment).
Test:
@sbueringer @fabriziopandini please kindly take a look again. Thanks for you time!